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Abstract 

Information-maximization clustering learns a probabilistic classifier in an unsuper- 
vised manner so that mutual information between feature vectors and cluster as- 
signments is maximized. A notable advantage of this approach is that it only in- 
volves continuous optimization of model parameters, which is substantially easier 
to solve than discrete optimization of cluster assignments. However, existing meth- 
ods still involve non-convex optimization problems, and therefore finding a good 
local optimal solution is not straightforward in practice. In this paper, we propose 
an alternative information-maximization clustering method based on a squared-loss 
variant of mutual information. This novel approach gives a clustering solution an- 
alytically in a computationally efficient way via kernel eigenvalue decomposition. 
Furthermore, we provide a practical model selection procedure that allows us to 
objectively optimize tuning parameters included in the kernel function. Through 
experiments, we demonstrate the usefulness of the proposed approach. 

Keywords 

Clustering, Information Maximization, Squared-Loss Mutual Information. 



1 Introduction 



The goal of clusteri ng is to classify d ata samples into disjoint groups in an unsupervised 
manner. K-means (iMacQueenl . Il967| ) is a classic but still popular clustering algorithm. 
However, since k-means only produces linearly separated clusters, its usefulness is rather 
limited in practice. 
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To cope with this problem, va rious non-linear clustering methods have been developed. 
Kernel k-means dGirolamll2002h performs k-means in a feature space ind uced by a repro- 
ducing kernel function (jScholkopf and Smolal . 120021 ) . Spectral clustering (IShi and Malik! . 



2000 



Ng et all l2002h first unfolds non-linear data manifolds by a spectral embed- 
ding method, and then performs k-means in the embedd ed space. Blurring mean-shift 



4> 

(IFukunaga and Hostetlerl . 119751 ; iCarreira-Perpinanl . 120061 ) uses a non-parametric kernel 



density estimator for modeling the data-generating probability density, and finds clusters 
based on the modes of the estimated density. Discriminative clustering learns a discrim- 
inative classifier for separa t ing clusters, where class labels ar e also treated as parameters 
to be optimized (IXu et all 120051 ; iBach and Harchaouil . 120081 ) . Dependence-maximization 
clusteri ng determines clu s ter assignments so that their depend ence on input data is max- 
imized (ISong et all 120071 ; iFaivishevsky and Goldbergerl . 120101 ) . 

These non-linear clustering techniques would be capable of handling highly complex 
real-world data. However, they suffer from lack of objective model selection strategies^. 
More specifically, the above non-linear clustering methods contain tuning parameters 
such as the width of Gaussian functions and the number of nearest neighbors in kernel 
functions or similarity measures, and these tuning parameter values need to be manually 
determined in an unsupervi sed manner. The problem of learning similarities /kernels was 



addressed in earl i er wo rks (IMeila and Sh.il . |2001| ; IShental et all 120031 ; ICour et all 12005 



Bach and Jordan! . 12006), but they considered sup e rvised setups, i.e., labeled samples are 



assumed to be given. IZelnik-Manor and Peronal (120051 ) provided a useful unsupervised 
heuristic to determine the similarity in a data-dependent way. However, it still requires 
the number of nearest neighbors to be determined manually (although the magic number 
'7' was shown to work well in their experiments). 

Another line of clustering framework called information-maxim i zation clustering ex - 



20061 : iGomes et all Ex- 



hibited the state-of-the-art performance (lAgakov and Barberj . 
In this information- maximization approach, p robabilistic classifiers such as a kernelized 
Gaussian classifier (lAgakov and Barberl . 120061 ) and a kernel logistic regression classifier 
(IGomes et all |2010| ) are learned so that mutual information (MI) between feature vectors 
and cluster assignments is maximized in an unsupervised manner. A notable advantage of 
this approach is that classifier training is formulated as continuous optimization problems, 
which are substantially simpler than discrete optimization of cluster assignments. Indeed, 
classifier training can be carried ou t in computationally efficie nt manners by a gr adient 
method (lAgakov and Barberl. 120061) o r a quasi- Newton method (IGomes et alll2010l ). Fur- 
thermore, lAgakov and Barberl (120061 ) provided a model selection strategy based on the 
information-maximization principle. Thus, kernel parameters can be systematically opti- 
mized in an unsupervised way. 

However, in the above Mi-based clustering approach, the optimization problems are 
non-convex, and finding a good local optimal solution is not straightforward in practice. 
The goal of this paper is to overcome this problem by providing a novel information- 
maximization clustering method. More specifically, we propose to employ a variant of 



1 'Model selection' in this paper refers to the choice of tuning parameters in kernel functions or 
similarity measures, not the choice of the number of clusters. 
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MI called squared-loss MI (SMI), and develop a new clustering algorithm whose solution 
can be computed analytically in a computationally efficient way via kernel eigenvalue 
decomposition. Furthermore, for kernel parameter optimiza tion, we propose t o use a 



2P 

non-parametric SMI estimator called least-squares MI (LSMI: iSuzuki et all . 12009). which 
was proved to achieve the optimal convergence rate with analytic-form solutions. Through 
experiments on various real-world datasets such as images, natural languages, accelero- 
metric sensors, and speech, we demonstrate the usefulness of the proposed clustering 
method. 

The rest of this paper is structured as follows. In Section 121 we describe our proposed 
information-maximization clustering method based on SMI. Then the proposed method 
is compared with existing clustering methods qualitatively in Section [3] and quantitatively 
in Section HI Finally, this paper is concluded in Section |5j 



2 Information-Maximization Clustering 
Squared-Loss Mutual Information 

In this section, we describe our proposed clustering algorithm. 



with 



2.1 Formulation of Information-Maximization Clustering 

Suppose we are given (^-dimensional i.i.d. feature vectors of size n, 

{Xi | Xi G M d }™ =1 , 

which are assumed to be drawn independently from a distribution with density p*(x). 
The goal of clustering is to give cluster assignments, 



{ Vi | Vi E {!,..., c}}l 



:1) 



to the feature vectors {a^}" =1 , where c denotes the number of classes. Throughout this 
paper, we assume that c is known. 

In o rder to solve the clu s tering problem, we take the information-maximization ap- 
proach (lAgakov and Barberl . 120061 ; iGomes et al.l . l2010f ). That is, we regard clustering as 
an unsupervised classification problem, and learn the class-posterior probability p*(y\x) 
so that 'information' between feature vector x and class label y is maximize d. 

2007t 



The 



dependence-maximiz ation 



approach 



(jSong et al 



Faivishevsky and Goldbergerl . 120101 . see also Section 13. 7j) is related to, but substantially 
different from the above information-maximization approach. In the dependence- 
maximization approach, cluster assignments {yi}f =1 are directly determined so that their 
dependence on feature vectors {a?j}" =1 is maximized. Thus, the dependence-maximization 
approach intrinsically involves combinatorial optimization with respect to {?/j}™ =1 . On 
the other hand, the information-maximization approach involves continuous optimization 
with respect to the parameter at included in a class-posterior model p(y\x;ct). This 
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continuous optimization of ex. is substantially easier to solve than discrete optimization 

o£{vt}U- 

Another advantage of the information-maximization approach is that it naturally al- 
lows out-of-sample clustering based on the discriminative model p(y\x; ex), i.e., a cluster 
assignment for a new feature vector can be obtained based on the learned discriminative 
model. 



2.2 Squared-Loss Mutual Information 

As an information measure, we adopt squared-loss mutual information (SMI). SMI be- 
tween feature vector x and class label y is defined by 



SMI 



^2p*(x)p*(y) 

y=l 



p*(x)p*(y) 



1 da;. 



where p*(x,y) denotes the joint densit y of x and y, a nd p*(y) is the marginal probability 
of y. SMI is the Pearson divergence (IPearsonl . Il900l ) from p*(x,y) to p*(x)p*(y), while 
the ordinary MI (I Cover and Thomad . 120061 ) , 



MI:= f £>*(a:,j/)log-J 



p*{x,y) 



x)p*{y) 



dx, 



(2) 



is the Kullback-Leibler divergence ( IKullback and Leiblerl . Il95ll ) from p*(x,y) to 
p*(x)p*(y). The Pearson divergence and the Kullback-Leibler divergence both belong 
to the class of Ali-S i lvey-Csiszdr d ivergences (which is also known as f -divergences, see 
Ali and Silveyl . Il966t ICsiszarl . 119671 ). and thus they share similar properties. For example, 
SMI is non- negative and takes zero if and only if x and y are statistically independent, 
as the ordinary MI. 



In the existing infor mation- maximization clustering methods (lAgakov and Barberl . 
20061 : iGomes et alll2010l . see also Section El]), MI is used as the information measure. On 
the other hand, in this paper, we adopt SMI because it allows us to develop a clustering 
algorithm whose solution can be computed analytically in a computationally efficient way 
via kernel eigenvalue decomposition. 



2.3 Clustering by SMI Maximization 

Here, we give a computationally-efficient clustering algorithm based on SMI ([T]). 
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Expanding the squared term in Eq.([T]), we can express SMI as 

2 



SMI= \ f J2p*(x)p*(y) 

lJ y=l 



p*(x)p*(y) 



dx 
1 



y=l 

c 



2 / J2p*(y\ x )p*( x 

J y=l 



p*(x)p*(y) 



dx . 

p*(y) 2 



(3) 



Suppose that the class-prior probability p*(y) is set to a user-specified value ir y for y = 
1, . . . , c, where ir y > and Y^y=i 71 y = 1- Without loss of generality, we assume that 
{ir y }y =1 are sorted in the ascending order: 

7Tl < ■ • • < 7T C . 

If is unknown, we may merely adopt the uniform class-prior distribution: 



P*(y) = - for ?/ = 1, . . . ,c, 



(4) 



which will be non-informative and thus allow us to avoid biasing clustering solutions^. 
Substituting n y into p*(y), we can express Eq.flS]) as 



1 f x C r 1 1 

2 / — P*(2/|^)P*(^)P*(2/|^)d£E — - 

2/=l y 



(5) 



Let us approximate the class-posterior probability p*(y\x) by the following kernel 
model: 



p{y\x; a) := a Vji K(x, x t ), 



(6) 



where a = (0:1,1, . . . , a c ,n) T is the parameter vector, T denotes the transpose, and K(x, x') 
denotes a kernel function with a kernel parameter t. In the experiments , we will use a 
sparse variant of the local- scaling kernel ( jZehiik-Manor and Peronal . 120051 ): 



K\Xi) Xj) 




\Xj Xq 



1(JiO~j 



if Xi E N t (xj) or Xj e J\f t (xi), 



otherwise, 



(7) 



2 Such a cluster-balance constraint is often e mployed in existing clustering algorithms (e.g., 
Shi and Malikl . l2000t IXu et all . l2005t iNiu et all . 1201 lh . 
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where Mt{x) denotes the set of t nearest neighbors for x (t is the kernel parameter), <Ji is 
a local scaling factor defined clS O % — *^ 

and xf is the t-th nearest neighbor of 

Xi . 

Further approximating the expectation with respect to p*(x) included in Eq.(jSD by 
the empirical average of samples {a^}™ =1 , we arrive at the following SMI approximator: 



SMI :=—Y—alK 2 cx 



1 



2n^ix v y y 2' y J 

y =i y 

where ot y := (at V) i, • • • , a^,n) T and K is j := K(Xi, Xj). 

For each cluster y, we maximize ctyK 2 a y undeiH \\oc y \\ = 1. Since this is the 
Rayleiqh Quotient, the ma ximizer is given by the normalized principal eigenvector of K 
(IHorn and Johnson! Il985l ). To avoid all the solutions {oc y }y =1 to be reduced to the same 
principal eigenvector, we impose their mutual orthogonality: ot^cxy' = for y ^ y'. Then 
the solutions are given by the normalized eigenvectors <f>i, . . . ,<f> c associated with the 
eigenvalues Ai > • • • > A„ > of K. Since the sign of 4> y is arbitrary, we set the sign as 

4>y = ct> y x sign(0jl n ), 

where sign(-) denotes the sign of a scalar and l n denotes the n-dimensional vector with 
all ones. 

On the other hand, since 



|cc, : ; oc) = cxlKl rn 



/l 
P*(y\x)p*(x)dx w - 22p(y\'. 
n i=i 

and the class-prior probability p*(y) was set to TT y for y = 1, . . . , c, we have the following 
normalization condition: 

OLyKl n = TT y . 

Furthermore, probability estimates should be non-negative, which can be achieved by 
rounding up negative outputs to zero. 

Taking these normalization and non-negativity issues into account, cluster assignment 
yi for Xi is determined as the maximizer of the approximation of p(y\xi): 

[max(O n , K(j}y)\i 7r 3/ [max(O n , 4> y )]i 

y-i = argmax = argmax 



y n y 1 max(O n ,K(f) y ) T l n y max(O n ,0 y ) T l n 

where the max operation for vectors is applied in the element-wise manner and [•], denotes 
the z-th element of a vector. Note that we used K(f> y = \ y 4> y in the above derivation. For 
out-of-sample prediction, cluster assignment y' for new sample x' may be obtained as 



TT y max (o, Ya=i k ( x ', x i) [4> y 



l , , . 'yn 

y := argmax 

y Aj / max(O n ,0 y ) T l J 



We call the above method SMI-based clustering (SMIC). 



3 Note that this unit-norm constraint is not essential since the obtained solution is renormalizcd later. 
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2.4 Kernel Parameter Choice by SMI Maximization 

The solution of SMIC depends on the choice of the kernel parameter t included in the ker- 
nel function K(x, x'). Since SMIC was developed in the framework of SMI maximization, 
it would be natural to determine the kernel parameter t so as to maximize SMI. A direct 
approach is to use the SMI estimator SMI (jSJ) also for kernel parameter choice. However, 
this direct approach is not favorable because SMI is an unsupervised SMI estimator (i.e., 
SMI is estimated only from unlabeled samples {cEj}" =1 ). On the other hand, in the model 
selection stage, we have already obtained labeled samples {{xi,yi)}2 =l , and thus super- 
vised estimation of SMI is possible. For supervised SMI esti mation, a non-param etric 



SMI estimator called least-squares mutual information (LSMI; ISuzuki et all 120091 ) was 
shown to achieve the optimal convergence rate. For this reason, we propose to use LSMI 
for model selection, instead of SMI (jSJ) . 

LSMI is an estimator of SMI based on paired samples {(cCj, yi)}™ =l - The key idea of 
LSMI is to learn the following density-ratio function, 

1 ,yh p*{x) P *{yY [) 

without going through density estimation of p*(x,y), p*(x), and p*(y). More specifically, 
let us employ the following density-ratio model: 

r{x,y;0):= £ fl/Lfoaj/), (10) 

l:y t =y 

where 6 = (6\, . . . , 6 n ) T and L(x, x') is a kernel function with a kernel parameter 7. In 
the experiments, we will use the Gaussian kernel: 

L(a,aO=exp( J |a ~f 112 ), (11) 

where the Gaussian width 7 is the kernel parameter. 

The parameter 6 in the above density-ratio model is learned so that the following 
squared error is minimized: 



mm 
2 



\ I J2(r(x,y;6) -r*(x,y)y P *(x)p*(y)dx. (12) 

J y=l 



Let By be the parameter vector corresponding to the kernel bases {L(x,Xe)}e :ye = y , i.e., 
6 y is the sub- vector of = (9\, ... , 9 n ) T consisting of indices {£ | yi — y}. Let n y be the 
length of y , i.e., the number of samples in cluster y. Then an empirical and regularized 
version of the optimization problem ffl2|) is given for each y as follows: 



mm 

da 



(13) 



s 



where 6 (> 0) is the regularization parameter. is the n y x n y matrix and is the 



n y - dimensional vector defined as 



1=1 



n z — ' 

where cc^ is the £-th sample in class y (which corresponds to #j[ ). 

A notable advantage of LSMI is that the solution 6™' can be computed analytically 

as 

§(y) = (H (y) + 5I)- 1 h {y) . 
Then a density-ratio estimator is obtained analytically as follows^: 

fly 

0(V) T Inn rj3l)\ 



r(x,y) = ^2fy y 'L(x,x 



£=1 



The accuracy of the above least-squares density-ratio estimator depends on the choice 
of the k ernel parameter 7 included in L(x,x') and the regularization parameter 5 in 



Eq. (fl3l) . [Suzuki et al.l (120091 ) showed that these tuning parameter values can be systemat- 
ically optimized based on cross-validation as follows: First, the samples Z = {(xi,yi)}^ =1 
are divided into M disjoint subsets {Z m }^ =1 of approximately the same size (we use 
M = 5 in the experiments). Then a density-ratio estimator r m (x,y) is obtained using 
Z\Z m (i.e., all samples without Z m ), and its out-of-sample error (which corresponds to 
Eq. fTl2l without irrelevant constant) for the hold-out samples Z m is computed as 

CV m := -— ; ? <n{x iy ) 2 - — — Y r m (x,y). 

x,yeZ m (x,y)£Z m 

This procedure is repeated for m = 1, . . . , M, and the average of the above hold-out error 
over all m is computed as 



CV: = 



1 M 



M 

m=l 



Note that, in the original LSMI paper (jSuzuki et al. , 2009), the entire parameter 6 = {9\ 



for all classes was optimized at once. On the other hand, we found that, when the density-ratio model 
r(x,y;0) defined by Eq. (fT0| is used for SMI approximation, exactly the same solution as the original 
LSMI paper can be computed more efficiently by class-wise optimization. Indeed, in our preliminary 
experiments, we confirmed that our class-wise optimization significantly reduces the computation time 
compared with the original all-class optimization, with the same solution. Note that the original LSMI 
is applicable to more general setups such as regression, multi-label classification, and structured-output 
prediction. Thus, our speedup was brought by focusing on classification scenarios where Kronecker's 
delta function is used as the kernel for class labels in the density-ratio model ([To]) . 
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Input: Feature vectors X = {a^}™ =1 and the number c of clusters 
Output: Cluster assignments y = {?/i}™ =1 

For each kernel parameter candidate t G T 
<— SMIC(^,t )C ); 
LSMI(t) < — LSMI(Af,yW); 
end 

t < — argmax LSMI(t); 

Figure 1: Pseudo code of information- maximization clustering based on SMIC and LSMI. 
The kernel parameter t refers to the tuning parameter included in the kernel function 
K(x, x 1 ) in the cluster-posterior model ([6]). Pseudo codes of SMIC and LSMI are described 
in Figure [2] and Figure [31 respectively. 



Finally, the kernel parameter 7 and the regularization parameter 5 that minimize the 
average hold-out error CV are chosen as the most suitable ones. 
Finally, based on an expression of SMI ([Q), 

SMI = -~ I ^Tr*(x,y) 2 p*(x)p*(y)dx + I ^2r*(x,y)p*(x,y)dx - -, 

J y=l J y=l 

an SMI estimator called LSMI is given as follows: 

n n 

LSMI := " 2^ S ^ + n E ? ^ Vi) - 2 ' ( 14 ) 

where r(x,y) is a density-ratio estimator obtained above. Since r(x,y) can be computed 
analytically, LSMI can also be computed analytically. 

We use LSMI for model selection of SMIC. More specifically, we compute LSMI as a 
function of the kernel parameter t of K(x, x') included in the cluster-posterior model fl6]), 
and choose the one that maximizes LSMI. A pseudo code of the entire SMI-maximization 
clustering procedure is summarized in Figures [IH5J Its MATLAB implementation is avail- 
able from 



http : / / sugiyama-www . cs . titech . ac . j p/~ sugi/ sof tware/SMIC 



3 Existing Clustering Methods 

In this section, we review existing clustering methods and qualitatively discuss the relation 
to the proposed approach. 
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Input: Feature vectors X = {xi}f =l , kernel parameter t, 

and the number c of clusters 
Output: Cluster assignments y = {yi}™ =1 

K < — Kernel matrix for samples X and kernel parameter t; 

4> y < — y-th principal eigenvectors of K for y — 1, . . . , c; 

4> y < — 4> y x sign(0jl„) for y = 1, . . . , c; 

[max(O n , . 
y.j < — argmax ^— - for i — 1, . . . , n; 

!/e{i,...,c} max(O n , y ) T l n 



Figure 2: Pseudo code of SMIC (with the uniform class-prior distribution). The kernel 
parameter t refers to the tuning parameter included in the kernel function K(x, x') in the 
cluster-posterior model (jSJ). If the class-prior probability p*(y) is set to a user-specified 
value 7T y for y = 1, . . . , c, yi is determined as argmax EeIe^s^^JL _ 



Input: Feature vectors X = {a3.;}™ =1 and cluster assignments 3^ = {Vi}i=i 
Output: SMI estimate LSMI 

Z < — {{xi,yi)}? =1 ; 

{Z m }^ =1 < — M disjoint subsets of Z\ 
For each kernel parameter candidate 7 G T 

For each regularization parameter candidate 5 G A 
For each fold m — 1, . . . , M 

r lt s t m(x,y) < — Density ratio estimator for (7, 5) using Z\Z m ; 
CV m (7,<5) < — Hold-out error of r 7) s >m (x, y) for Z m \ 
end 

1 M 

CV( 7 ,5)^--]TcV m (7,5); 

m=l 

end 
end 

(7, 5) < — argmin CV(7, 5); 
7er,<5eA 

r~(x, y) < — Density ratio estimator for (7, 5) using Z; 

1 n 1 n 1 

LSMI < «Za E ^ %') 2 + " E ^ V*) - 9" 



Figure 3: Pseudo code of LSMI. The kernel parameter 7 refers to the tuning parameter 
included in the kernel function L(x,x') in the density-ratio model ffTU]) . 
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3.1 K-Means Clustering 

K-means clustering ( MacQueenl . 1967 ) would be one of the most popular clustering algo- 



rithms. It tries to minimize the following distortion measure with respect to the cluster 
assignments {yi}f =1 : 



EE 

y=l i:yi=y 



ft, I 



(15) 



where \x y := — Yli- y = y x i * s the centroid of cluster y and n y is the number of samples in 
cluster y. 

The original k-m eans algorithm is capable of only producing linearly separated clusters 
(IDuda et all 1200 ll ). However, since samples are used only in terms of their inner products, 
its non-linear variant can be immediately obtai ned by performi ng k-means in a feature 
space induced by a reproducing kernel function (iGirolamil . 120021 1 . 



As the optimization problem of (kernel) k-means is NP-hard (lAloise et all [2009J) 



a greedy optimization algorithm is usually used for finding a local optimal solution in 
practice. It was shown that the solution to a continuously-relaxed variant of the kernel 



k-means problem is given by the principal components of the kernel matrix (IZha et al. 



20021 ; iDing and Hel . 12004] ). Thus, post-discretization of the relaxed solution may give a 
good approximation to the original problem, which is computationally efficient. This idea 
is similar to the proposed SMIC method described in Section 12.31 However, an essential 
difference is that SMIC handles the continuous solution directly as a parameter estimate 
of the class-posterior model. 

The performance of kernel k-means depends heavily on the choice of kernel functions, 
and there is no systematic way to determine the kernel function. This is a critical weakness 
of kernel k-means in practice. On the other hand, our proposed approach offers a natural 
model selection strategy, which is a significant advantage over kernel k-means. 



3.2 Spectral Clustering 



The basic idea of spectral clustering ( jShi and Malikl . |2000| ; iNg et all |2002| ) is to first unfold 
non-linear data manifolds by a spectral embedding method, and then perform k-means 
in the embedded space. More specifically, given sample-sample similarity Wij > (large 
Wij means that cCj and Xj are similar), the minimizer of the following criterion with 
respect to {^j}™ =1 is obtained under some normalization constraint: 



1,3 



where D is the diagonal matrix with i-th diagonal element given by D^i := Y^™ = 
Consequently, the embedded samples are given by the principal eigenvectors of 
D 2\VD~z , followed by normalization. Note that spectral clustering was shown to be 
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equiv alent to a weighted variant of kernel k-means with some specific kernel (IDhillon et al. 
2004h . 

The perfor mance of spectral clustering depen ds heavily on the choice of sample-sample 
similarity Wij. IZelnik-Manor and Peronal (120051 ) proposed a useful unsupervised heuristic 
to determine the similarity in a data-dependent manner, called local scaling: 



W itj = exp 

where <Tj is a local scaling factor defined as 



\Xi X j 



Mi 



and xf is the t-th nearest neighbor of £Cj. t is the tuning parameter in the local scaling 
simil arity, and t = 7 was shown to be useful (IZelnik-Manor and Peronal . l2005t ISugiyamal . 
20071 ). However, this magic number '7' does not seem to work always well in general. 

If D —2 WD 2 is regarded as a kernel matrix, spectral clustering will be similar to 
the proposed SMIC method described in Section 12.31 However, SMIC does not require 
the post k-means processing since the principal components have clear interpretation as 
parameter estimates of the class-posterior model ([6]). Furthermore, our proposed approach 
provides a systematic model selection strategy, which is a notable advantage over spectral 
clustering. 



3.3 Blurring Mean-Shift Clustering 



Blurring mean-shift ( IFukunaga and Hostetlerl . 1 19 751 1 is a non-parametric clustering 



method based on the modes of the data-generating probability density. 



In the blurring mean-shift algorithm, a kernel density estimator ([Silverman! Il986l ) is 
used for modeling the data-generating probability density: 



1 n 

8=1 



where K(£) is a kernel function such as a Gaussian kernel K(£) = e~^ 2 . Taking the 
derivative of p(x) with respect to x and equating the derivative at x = Xi to zero, we 
obtain the following updating formula for sample Xi (i — 1, . . . , n): 



X j 



where := K' {\\xi — Xj\\ 2 /o~ 2 ^ and K'(£) is the derivative of K(£). Each mode of the 
density is regarded as a representative of a cluster, and each data point is assigned to the 
clu ster which it conve r ges to . 



Carreira-Perpinanl (120071 ) showed that the blurring mean-shift alg o rithm can be 



interpreted as an expectation-maximization algorithm ( lDempster et all 119771 ). where 
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Wi,i/(X)y=i Wij') is regarded as the posterior probability of the z-th sample belonging 
to the j-th cluster. Furthermore, the above update rule can be expressed in a matrix 
form as X < — XP, where X = (xi, . . . , x n ) is a sample matrix a nd P 



WD 1 is a 



stochastic matrix of the random walk in a graph with adjacency W (jChungl . 119971 ) . D is 
defined as D iti := Y^=i an d -Dy = for i ^ j. If P is independen t of X. the above 
iterative algorithm corresponds to the power method ( iGolub and Loan! Il996l ) for finding 
the leading left eigenvector of P. Then, this algorithm is highly related to the spectral 
clustering which computes the princ ipal eigenvectors o f D~ ^W D >~a (see Section [3.21) . 
Although P depends on X in reality, ICarreira-Perpinanl (120061 ) insisted that this analysis 
is still valid since P and X quickly reach a quasi-stable state. 

An attractive property of blurring mean-shift is that the number of clusters is automat- 
ically determined as the number of modes in the probability density estimate. However, 
this choice depends on the kernel parameter a and there is no systematic way to determine 
er, which is restrictive compared with the proposed method. Another critical drawback 
of the blurring mea n-shift algorit hm is that it eventually converges to a single point (i.e., 
a single cluster, see IChengl . Il995l . fo r details), and therefore a sensible stopping criterion 
is necessary in practice. Although ICarreira-Perpinanl (120061 ) gave a useful heuristic for 
stopping the iteration, it is not clear whether this heuristic always works well in practice. 



3.4 Discriminative Clustering 



The support vector machine (SVM; IVapnikl . Il995l ) is a supervised discriminative classifier 
that tries to find a hyperpl ane separating positive and negative samples with the maximum 
margin. IXu et al.l ( 120051 ) extended SVM to unsupervised classification scenarios (i.e., 
clustering), which is called maximum-margin clustering (MMC). 

MMC inherits the idea of SVM and tries to find the cluster assignments y = 
(yi, . . . ,y n ) T so that the margin between two clusters is maximized under proper con- 
straints: 



min max 2A T l n — (K o AA T , yy ) 

ye{+i,-i} n a 

subject to — e < l n y < e and n < X < Cl n , 

where o denotes the Hadamard product (also known as the entry-wise product), and e 
and C are tuning parameters. The constraint — e < l^y < e corresponds to balancing 
the cluster size. 

Since the above optimization problem is combinatorial with respect to y and thus 
hard to solve directly, it is relaxed to a semi-definite program by replacing y y T (which 



is a z ero-one matrix with rank one) with a real positive semi-definite matrix ( IXu et al. 



20051 ). Since then, several app roaches have been developed for further improving the com 



putational efficiency of MMC (IValizadegan and Jinl . 120071 ; IZhao et al.l . 120081 ; IZhang et al. 
20091 ; iLi et all . 120091 : IWang et alll2010l ) 



The performance of MMC depends heavily on the choice of the tuning parameters e 
and C, but there is no systematic method to tune these parameters. The fact that our 
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proposed approach is equipped with a model selection strategy would practically be a 
strong advantage over MMC. 

Following a sim ilar line to MMC, a discri minative and flexible framework for clus- 



tering (DIFFRAC; iBach and Harchaouil . 120081 ) was proposed. DIFFRAC tries to solve 



a regularized least-squares problem with respect to a linear predictor and class labels. 
Thanks to the simple least-squares formulation, the parameters in the linear predictor 
can be optimized analytically, and thus the optimization problem is much simplified. A 
kernelized version of the DIFFRAC optimization problem is given by 

min tT(UU T KT(TKT + nKl n y 1 T), 

U6{+1,-1}« 

where II is the n x c cluster indicator matrix, which takes 1 only at one of the elements 
in each row (this corresponds to the index of the cluster to which the sample belongs) 
and others are all zeros, k (> 0) is the regularization parameter, and T := I n — -1^.1^ 
is a centering matrix. In practice, the above optimization problem is relaxed to a semi- 
definite program by replacing IIII T with a real positive semi-definite matrix. However, 
DIFFRAC is still computationally expensive and it suffers from lack of objective model 
selection strategies. 

3.5 Generative Clustering 



In the generative clustering framework (IDuda et all 120011 ). class labels are determined by 



y — argmax p*(y\x) = argmax p*(x,y), 
y y 

where p*(y\x) is the class-posterior probability and p*(x,y) is the data-generating prob- 
ability. Typically, p*(x,y) is modeled as 

p(x, y; (3, 7r) = p(x\y; f3)p(y; tt), 

where (3 and tt are parameters. Canonical model choice is the Gaussian distribution for 
p(x\y; (3) and the multinomial distribution for p(y; tt). 

However, since class labels {?/j}™ =1 are unknown, one may not directly learn /3 and tt 
in the joint-probability model p(x, y; (3, tt). An approach to coping with this problem is 
to consider a marginal model, 

c 

p(x; (3, tt) = ^2p(x\y; (3)p(y; tt), 

y=l 



and learns the parameters (3 and 7r by maximum likelihood estimation ( IDuda et al.Ll200l[ ): 



n 

maxTTp(a; i ;/3, 7r). 

i = l 
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Since the likel ihood function of the above mixture model is non-convex, a gradient method 
( jAmaril . 119671 ) may be used for finding a local maximizer in practice. For determining 
the number of c lusters (mixtures) and the mixing-element model p(x\y;(3), likelihood 
cross-validation (IHardle et all 12004 ) may be used. 

Another approach to coping with the unavailability of class labels is to regard 
as laten t vari ables, and apply the expectation-maximization (EM) algorithm 



(IDempster et all 119771 ) for finding a local maximizer of the joint likelihood: 



maxTTp(a; i ,?/ i ;/3,7r). 

/3,7T 

1=1 

A more flex i ble va riant of the EM algorithm called the split- and-merge EM algorithm 
flUeda et all l2000h is also available, which dynamically controls the number of clusters 
during the EM iteration. 

Instead of point-estimating the p arameters (3 an d 7r, one can also consider their distri- 
butions in the Bayesian framework (jBishopl . 120061 ). Let us introduce prior distributions 
p(/3) and p(tt) for the parameters f3 and 7r. Then the posterior distribution of the pa- 
rameters is expressed as 

p{(3,iz\X) <xp{X\Mp{p)p{iz), 
where X = {cCj}" =1 . Based on the Bayesian predictive distribution, 

p{y\x,X)(x J J p(aj,y|/3,7r)p(/3,7r|Ar)d0(hr, 



class labels are determined as 



maxp{y\x, X). 
y 



Because the integration included in the Bayesian predictive distribution is compu- 
tationally expensive, conjugate priors are often adopted in practice. For example, for 
the Gaussian-cluster model p(x\y;(3), the Gaussian prior for the mean parameter and 
the Wishart prior is assumed for the precision parameter (i.e., the inverse covariance) 
are assumed; the Dirichlet prior is assumed for the multinomial model p(y; 7r ). Other- 
wise, the posterior distribution is approximated by the Laplace appro ximation ( iMacKayl . 
20031 ) , the Mar kov cha i n Monte Carlo sampling (lAndrieu et all 120031 ) , or the variational 
approximation (lAttiasl . |2000| ; iGhahramani and Beall . |2000| ). The number of clusters can 
be determined based on the maximization of the marginal likelihood: 



p(X) 



argmax 

y 



P (X\(3,n)p(l3)p(7r)df3dTr. 



(16) 



The generative clustering methods are statistically well-founded. However, density 
models for each cluster p*(x\y) need to be specified in advance, which lacks flexibility 
in practice. Furthermore, in the Bayesian approach, the choice of cluster models and 
prior distributions are often limited to conjugate pairs in practice. On the other hand, 
in the frequentist approach, only local solutions can be obtained in practice due to the 
non-convexity caused by mixture modeling. 
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3.6 Posterior-Maximization Clustering 

Another possible clustering approach based on probabilistic inference is t o directly max- 
imizes the posterior probability of class labels y = {?/i}™ =1 ( iBishopl . 120061 ): 



maxp*(y\X). 

Let us model the cluster-wise data distribution p*(X\y) by p(X\y,(3). 

An approximate inference method called iterative conditional modes 



( IKurihara and Welling . 120091 ) alternatively maximizes the posterior probabilities of 
y and (3 until convergence: 



y 



p(y\x,p), 



When the Gaussian model with covariance identity is assumed for p(y\X,(3), this algo- 
rithm is reduced to the k-means algorithm (see Section 13. ip under the uniform priors. 

Let us consider the class-prior probability p*(y) and model it by p(y\ir). Introducing 
the prior distributions p(f3) and p(ir), we can approximate the posterior distribution of 
y as 

p(y\X) oc J J p(X\y,[3)p(f3)p(y\7r)p(7r)d[3dir. 

Similarly to generative clustering described in Section 13. 5[ conjugate priors such as the 
Gauss- Wishart prior and the Dirichlet prior are practically useful in improving the compu- 
tational efficiency The number of clusters can also be similarly determined by maximizing 
the marginal likelihood ( !T6|) . However, direct optimization of 3^ is often computationally 
intractable due to c n combinations, where c is the number of clusters and n is the number 
of samples. For this reason, efficient sampling schemes such as the Markov chain Monte 
Carlo are indispensable in this ap proach. 



A Dirichlet process mixture ([Ferguson! Il973t lAntoniakl . 119741 ) is a non-parametric 
extension of the above approach, where an infinite number of clusters are implicitly con- 
sidered and the number of clusters is automatically determined based on observed data. 
In order to improve the computational efficiency of this infinite mixture approach, vari- 
ous approximation schemes such as Markov chain M onte Carlo sampling (INeall . 120001 ) and 
variational approximation ( jBlei and Jordan! . 120061 ) have been introduc ed. Furthermore , 
variants of Dirichlet process es such as hierar c hical Dirichlet processes (jTeh et all 120071 ). 
nested Dirichlet processes ( [Rodriguez et al.l . 120081 ). and dependent Dirichlet processes 
( ILin et al.l . |2010[ ) have been developed recently. 

However, even in this non-parametric Bayesian approach, density models for each 
cluster still need to be parametrically specified in advance, which is often limited to 
Gaussian models. This highly limits the flexibility in practice. 
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3.7 Dependence-Maximization Clustering 



The HUbert- Schmidt independence criterion (HSIC; iGretton et all 120051) is a dependence 



measu re based on a reproducing kernel function K(x, x') ( lAronszajnl . 119501 ) . ISong et al. 



( 120071 ) proposed a dependence-maximization clustering method called clustering with 
HSIC (CLUHSIC), which tries to determine cluster assignments {yi}f=i so that their 
dependence on feature vectors {a?j}^ =1 is maximized. 

More specifically, CLUHSIC tries to find the cluster indicator matrix IT (see Sec- 
tion (331) that maximizes 



where Ka 



tr(KnAn T ), 

and A is a c x c cluster-cluster similarity matrix. Note that 



nAII T can be regarded as the kernel matrix for cluster assignments. ISong et al.l (120071 ) 
used a greed y algorithm to opti mize the cluster indicator matrix, which is computationally 
demanding. lYang et al.l ( 120101 ) gave spectral and semi-definite relaxation techniques to 
improve the computational efficiency of CLUHSIC. 

HSIC is a kernel-based independence measure and the kernel function K(x, x') needs 
to be determined in advance. However, there is no systematic model selection strategy 
for HSIC, and using the Gaussian kernel wi th width set to the me dian distance between 
samples is a standard heuristic in practice (IFukumizu et al.l . 120091 ) . On the other hand, 



our proposed approach is equipped with an objective model selection strategy, which is a 
notable advantage over CLUHSIC. 

Another line of dependence-maximization clustering adopts mutual information (MI) 
as a dependency measure. Recently, a dependence-maximiz ation clustering method called 
mean nearest-neighbor (MNN) clustering was proposed ( Faivishevsky and Goldbergerl . 
20101 ). MNN is based on the fc-nearest-neighbor entropy estimator proposed by 



Kozachenko and Leonenkol ( 1987 ). 

The performance of the original fc-nearest-neighbor entropy estimator depends on the 
choice of the number of nearest neighbors, k. On the other hand, MNN avoids this 
problem by introducing a heuristic of taking an average over all possible k. The resulting 
objective function is given by 



y=l 



n„ 



\Xi Xj 



+ e), 



(17) 



where e (> 0) is a smoothing parameter. Then this objective function is minimized with 
respect to cluster assignments {yi}f =1 using a greedy algorithm. 

Although the fact that the tuning parameter k is averaged out is convenient, this 
heuristic is not well justified theoretically. Moreover, the choice of the smoothing param- 
eter e is arbitrary. In the MATLAB code provided by one of the authors, e = 1/n was 
recommended, but there seems no justification for this choice. Also, due to the greedy op- 
timization scheme, MNN is computationally expensive. On the other hand, our proposed 
approach offers a well-justified model selection strategy, and the SMI-based clustering 
gives an analytic-form solution which can be computed efficiently 
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3.8 Information-Maximization Clustering with Mutual Infor- 
mation 

Finally, we review methods o f information- maximizatio n clustering based on mutual infor- 
mation ( Agakov and Barber . 2006 ; Gomes et al. . 2010l ). which belong to the same family 
of clustering algorithms as our proposed method. 

Mutual information (MI) is defined and expressed as 



MI 



2_^p (x,y)\og — ; — da; 



y=l 

c 



' p*(x)p*(y) 



^2 p*(y\x)p*(x) log p*{y\x)dx - / ^2p*(y\x)p*(x) logp*(y)dx 



y=l 



y=l 



Let us approximate the class-posterior probability p*(y\x) by a conditional-probability 
model p(y\x; ex) with parameter ex. Then the marginal probability p*(y) can be approxi- 
mated as 



p*{y) 



p 



l - 

(y\x)p*(x)dx - 22p(y\xi] ex). 
i=i 



(19) 



By further approximating the expectation with respect to p*(x) included in Eq. (|T8l) by 
the empirical average of samples jxj}" the following MI estimator can be obtained 
( Agakov and Barber . 2006 ; Gomes et al. . 2010l ): 

j n c 

MI := - y SyS2,p{y\x i ; ol) log p(y\xt; ex) 

i=l y=l 



(20) 



y=X 



1=1 



3=1 



In lAgakov and Barber! (120061 ). the Gaussian model, 



p(y\x; ex) oc exp 



\x — c 



24 +b " 



(or its kernelized version) is adopted, where ex = {c y ,s y , b y } y=1 is the parameter. Then a 

local maximizer of MI with respect to the parameter ex is found by a gradient method. 
On the other hand, in iGomes et al.l (120101 ). the logistic model 



p(y\x;cx) oc exp (cx y x) , 



(21) 



(or its kernelized version) is adopted, where ex = {cx y }y =1 is the parameter. Then a local 
maximizer of MI with respect to the parameter ex is found by a quasi-Newton method. 
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Finally, cluster assignments {yi}2=i are determined as 

jji = axgmaxp[y\xi;a), 
y 

where a is a local maximizer of MI. Below, we refer to the above method as Mi-based 
clustering (MIC). 

In the kernelized version of MIC, the user needs to determine parameters included 
in the kernel funct i on suc h as the kernel width or the number of nearest neighbors. 
Agakov and Barberl (120061 ) proposed to choose the kernel parameters so that MI (120]) 
is maximized. Thus, cluster assignments and kernel parameters can be consistently de- 
termined under the common guidance of maximizing MI. However, since MI is an unsu- 
pervised estimator of MI, it is not accurately enough; in the model selection stage, clus- 
ter labels {yi}i=i are available and thus supervised estimation of MI is more favorable. 
Indeed, ther e exists a more pow erful supervised MI estimator called maximum-likelihood 
MI (MLMI; ISuzuki et al.l . 120081 ). which was proved to achieve the optimal non-parametric 
convergence rate. 

The derivation of MLMI follows a similar line to LSMI explained in Section 12^ i.e., 
the density-ratio function ([9]) is learned. More specifically, the following density-ratio 
model r(x, y; 6) is used: 



r(x,y;0) := ^ 6 e L(x,x t ), 



t--y%=y 



where 6 = (8\, . . . ,8 n ) T and L(x,x') is a kernel function with a kernel parameter 7. 
Then the parameter is learned so that the Kullback-Leibler divergence from p*(x,y) 
to r(x,y; 0)p*(x)p*(y) is minimizecfl An empirical version of the MLMI optimization 
problem is given as 



max 
9 



1 

- y2logr(xi,yi;0) 

i=l 

1 - 

zz22 r ( x i>y* ) = 1 



and 6 > r 



where n denotes the n-dimensional vector with all zeros and the inequality for vectors 
is applied in the element-wise manner. This is a convex optimization problem, and thus 
the global optimal solution 0, which tends to be sp arse, can be easily obtained by, e.g., a 
projected gradient method ( jSugiyama et al.[ 120081 ). 

Then an MI estimator called MLMI is given as follows: 



1 n 

MLMI := -Vlogrte.y^e). 



The kernel parameter 7 included in the k ernel function L(x , x') can be optimized by 
cross-validation, in the same way as LSMI (ISuzuki et al.l . 120081 ) . 



Note that r(x,y;6)p*(x)p*(y) can be regarded as a model of p*(x,y). 
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4 Experiments 

In this section, we experimentally evaluate the performance of the proposed and existing 
clustering methods. 

4.1 Illustration 

First, we illustrate the behavior of the proposed method using the following 4 artificial 
datasets with dimensionality d = 2 and sample size n = 200: 

(a) Four Gaussian blobs: For the number of classes c = 4, samples in each class are 

drawn from the Gaussian distributions with mean (2,2) T , (— 2,2) T , (2, — 2) T , and 
(—2, — 2) T and covariance matrix 0.25/2, respectively. 

(b) Circle & Gaussian: For c = 2, samples in one class are drawn from the 2- 

dimensional standard normal distribution, and samples in the other class are equi- 
distantly located on the origin-centered circle with radius 5. Then noise following 
the origin-centered normal distribution with covariance matrix O.OU2 is added to 
each sample. 

(c) Double spirals: For c = 2, the i-th sample in one class is given by 

{ti cos(mj), l{ sin(mj)) T , and the i-th sample in the other class is given by 
(— £i cos(mj), — ii sin(mj)) T , where l{ — 1 + 4(i — l)/n and = 3ii(i — l)/n. Then 
noise following the origin-centered normal distribution with covariance matrix O.OU2 
is added to each sample. 

(d) High & low densities: For c = 2, samples in one class are drawn from the 2- 

dimensional standard normal distribution, and samples in the other class are drawn 
from the 2-dimensional origin-centered normal distribution with covariance matrix 
0.01J 2 . 

The class-prior probability was set to be uniform. The generated samples were centralized 
and their variance was normalized in the dimension-wise manner (see the top row of 
Figure H]). A MATLAB code for generating these samples are available from 

'http : / / sugiyama-www . cs . titech . ac . j p/~ sugi/ sof tware/SMIC '. 

As a kernel function, we used the sparse local-scaling kernel ([7]) for SMIC, where the 
kernel parameter t was chosen fronfl{l,...,10} based on LSMI with the Gaussian kernel 

(HU). 

The top graphs in Figure H] depict the cluster assignments obtained by SMIC with 
the uniform class-prior, and the bottom graphs in Figure H] depict the model selection 
curves obtained by LSMI (i.e., the values of LSMI as functions of the model param- 
eter t). The clustering performance was evaluated by the adjusted Rand index (ARI; 



6 We confirmed that t larger than 10 was not chosen in this experiment. 
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Figure 4: Illustrative examples. Cluster assignments obtained by SMIC (top) and model 
selection curves obtained by LSMI (bottom). 
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Figure 5: Illustrative examples. Cluster assignments obtained by MIC (top) and model 
selection curves obtained by MLMI (bottom). 
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Hubert and Arabia Il985l ) between inferred cluster assignments and the ground truth cat- 
egories (see Appendix for the details of ARI). Larger ARI values mean better performance, 
and ARI takes its maximum value 1 when two sets of cluster assignments are identical. 
The results show that SMIC combined with LSMI works well for these toy datasets. 

Figure |5] depicts the cluster assignments and model selection curves obtained by MIC 
with MLMI (see Section f3 . 8 j) . where pre-training of the kern el logistic model using the 



clust er assignments obtained by self-tuning s pectral cluster i ng (jZelnik-Manor and Perona 



20051 ) was carried out for initializing MIC ( jGomes et al.l . 120101 ). The figure shows that 



qualitatively good clustering results were obtained for the datasets (a) and (b). However, 
for the datasets (c) and (d), poor results were obtained due to local optima of the objective 
function Q2QJ. 

Figure |6]and Figure [7] depict class-posterior probabilities estimated by SMIC and MIC, 
respectively. The plots show that, for the datasets (a), (b), and (c) where the clusters are 
clearly separated, the estimated class-posterior probabilities are almost zero-one functions 
and thus the class prediction is highly certain. On the other hand, for the dataset (d) 
where the two clusters are overlapped, the estimated class-posterior probabilities tend to 
take intermediate class-posterior probabilities. 

4.2 Influence of Imbalanced Class-Prior Probabilities 

Next, we experimentally investigate how imbalanced class-prior probabilities (i.e., the 
sample size in each cluster is significantly different) influence the clustering performance 
of SMIC. 

We continue using the 4 artificial datasets used in Section 14.14 but we set the true 
class-prior probability as 

p*(y = 1) = p*{y = 2) = 0.1, 0.15, 0.2, 0.25, 

l-p*{y = l)-p*(y = 2) 



p*( y = 3)=p*(y = A) 
for the dataset (a), and 

p*( y = 1) = 0.2,0.3,0.4,0.5, 
P *(y = 2) = l-p*(y = l), 

for the datasets (b)-(d). The following 2 approaches are compared: 
SMIC: SMIC with the uniform class-prior probabilities 7i"i = 712 = 1/2. 

SMIC*: SMIC with the true class-prior probabilities tx\ = p*(y = 1) and 7r 2 = p*(y = 2). 

The mean and standard deviation of ARI over 100 runs are plotted in Figure[BJ showing 
that the difference between SMIC and SMIC* is negligibly small. Indeed, the two methods 
were judged to be comparable to each other in terms of the average ARI by the t-test at 
the significance level 1% for all tested cases. This implies that SMIC is not sensitive to the 
choice of class-prior probabilities. Thus, in practice, SMIC with the uniform class-prior 
distribution may be used when the true class-prior is unknown. 
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Figure 6: Illustrative examples. Class-posterior probabilities estimated by SMIC. 
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Figure 7: Illustrative examples. Class-posterior probabilities estimated by MIC. 
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Figure 8: Illustrative examples. The mean ARI over 100 runs as functions of the class- 
prior probability p*(y = 1). The two methods were judged to be comparable in terms of 
the average ARI by the t-test at the significance level 1%. 

4.3 Performance Comparison 

Finally, we systematically compare the performance of the proposed and existing clus- 
tering methods using various real-world datasets such as images, natural languages, ac- 
celerometric sensors, and speech. 



4.3.1 Setup 

We compared the performance of the following methods, which all do not contain open 
tuning parameters and therefore experimental results are fair and objective: 



KM: K- means (iMacQueenl . 1 196 71 . see also Section I3TTT) . We used the software included 
in the MATLAB Statistics Toolbox, where initial values were randomly generated 
100 times and the best result in terms of the k-means objective value was chosen as 
the final solution. 



SC: Spectral clustering flShi and MaliM . I200CH: iNg et all 12002. see also Section EOD with 
the self-tuning local-scaling similarity (jZelnik-Manor and Peronal . 120051 ). We used 
the MATLAB code provided by one of the author^, where the post k-means process- 



|http : //webee . technion . ac . il/~lih i/Demos/Self TuningClustering . html 
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ing was repeated 10 times with heuristic initialization: the first center was chosen 
randomly from samples, and then the next center was iteratively set to the farthest 
sample from the previous ones. The best result in terms of the k-means objective 
value out of 10 repetitions was chosen as the final solution. 



MNN: Mean nearest-neighbor clustering ( iFaivishevsky and Goldbergerl . 120101 see also 



Section ET7|) . We used the MATLAB code provided by one of the authors^. Following 
the suggestions provided in the program code, the number of iterations was set to 
10 and the smoothing parameter e (see Eq.f lTTj) ) was set to e = l/n. 

MIC: Mi-based clusterin g with kernel logistic models and the sparse local-scaling kernel 



( IGomes et all |2010| . see also Se ction 13.81). where m odel selection is carried out by 



maximum-likelihood MI (MLMI; ISuzuki et al.l . 120081 ). We implemented this method 



using MATLAB, which is a combination of the MIC code personally provided by 
one of the authors, and the MLMI code available from the web page of one of the 
authors^. Following the suggestion provided in the original program code, MIC was 
initialized by pre-training of the kernel logistic model using the cluster assignments 
obtained by spectral clustering. The tuning parameter t included in the sparse 
local-scaling kernel ([7]) was chosen from {1, . . . , 10} based on MLMI with Gaussian 
kernels (see Section 13. 8p . The Gaussian kernel width in MLMI was chosen from 
{10~ 2 , 10~ 15 , 10 _1 , . . . , 10 2 } based on cross-validation. As suggested in the MLMI 
code provided by the author, the number of kernel bases in MLMI was limited to 
200, which were randomly chosen from all n kernels. 

SMIC: SMI-based clustering with the sparse local-scaling kernel and the uniform class- 
prior distribution (see Section 12.31). w here model selection is carried out by least- 



squares MI (LSML lSuzuki et all 12001 see also Section [23D- We implemented SMIC 
and LSMI using MATLAB by ourselves. The tuning parameter t included in the 
sparse local-scaling kernel (j7|) was chosen from {1, . . . , 10} based on LSMI with 
Gaussian kernels (see Section 12.41) . The Gaussian kernel width and regulariza- 
tion parameter included in LSMI were chosen from {10 -2 , 10" L5 , 10 _1 , . . . , 10 2 } and 
{10 -3 , 10 -2 ' 5 , 10~ 2 , . . . , 10 1 }, respectively, based on cross-validation. Similarly to 
MLMI, the number of kernel bases in LSMI was limited to 200, which were ran- 
domly chosen from all n kernels. 

In addition to the clustering quality in terms of ARI, we also evaluated the computa- 
tional efficiency of each method by the CPU computation time. 

4.3.2 Datasets 

We used the following 6 real-world datasets. 

8 http : //www . levf aivishevsky . webs . com/NIC . rar 



http : / / sugiyama-www . cs . t itech . ac . jp/~sugi/sof tware/MLMI/ index . html 
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Digit (d = 256, n = 5000, and c = 10): The USPS hand-written digit dataseto, which 
contains 9298 digit images. Each image consists of 256 (= 16 x 16) pixels and rep- 
resents a digit in {0, 1, 2, ... , 9}. Each pixel takes a value in [—1, +1] corresponding 
to the intensity level in gray-scale. We randomly chose 500 samples from each of 
the 10 classes, and used 5000 samples in total. 



Face (d = 4096, n = 100, and c = 10): The Olivetti Face dataseO, which contains 400 
gray-scale face images (40 people; 10 images per person). Each image consists of 
4096 (= 64 x 64) pixels and each pixel takes an integer value between and 255 as 
the intensity level. We randomly chose 10 people, and used 100 samples in total. 



Document (d = 50, n = 700, and c = 7): The 20-Newsgroups dataseto, which contains 
20000 newsgroup documents across 20 different newsgroups. We merged the 20 
newsgroups into the following 7 top-level categories: : comp\ 'rec\ 'scz', Halk\ 
l alt\ 'misc\ and 'soc'. Each document is expressed by a 10000 -dimensional baq - 
of-words vector of term-frequencies. Following the convention (jJoachimd . 120021 ). 
we transformed the term-frequency vectors to the term frequency/inverse document 
frequency (TFIDF) vector, i.e., we multiplied the term-frequency by the logarithm 
of the inverse ratio of the documents containing the corresponding word. We ran- 
domly chose 100 samples from each of the 7 cla sses, and used 700 samples in total. 
We applied principal component analysis (PCA; IPearsonl . Il90ll ; Ijolliffel . Il986l ) to the 
700 samples, and extracted 50-dimensional feature vectors. 



Word (d = 50, n = 300, and c = 3): The SENSEVAL-2 dataseOfor word-sense disam- 
biguation. We took the noun 'interest' appeared in 1930 contexts, having 3 different 
meanings: 'advantage, advancement or favor', 'a share in a company or business', 
and 'money paid for the use of money' (i.e., 3 classes). From each surr ounding; 
context, we extracted a 14936-dimensional feature vector (INiu et al.l . 120051 ). which 
includes three types of features: part- of- speech of neighboring words with posi- 
tion information, bag-of-words in the surrounding context, and local collocation 



( jLee and Ngi . |2002| ). We randomly chose 100 samples from each of the 3 classes, 
and used 300 samples in total. We applied PCA to the 300 samples, and extracted 
50-dimensional feature vectors. 



Accelerometry (d = 5,n = 300, and c = 3): The ALKAN dataseto, which contains 3- 
axis (i.e., x-, y-, and z-axes) accelerometric data collected by the iPod touch. In 
the data collection procedure, subjects were asked to perform three specific tasks: 
walking, running, and standing up. The duration of each task was arbitrary, and the 
sampling rate was 20Hz with small variations. Each data-stream was then segmented 



10 http : //www .gaussianprocess . org/gpml/data/ 

11 http : //www . cs . toronto . edu/~r owe is /data. html 



12 http : //people . csail .mit . edu/jrennie/20Newsgroups/ 

13 http : //www . senseval . org/ 

14 



http : / / alkan . mns . kyutech .ac.jp/ web/ data . html 
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i n a sliding window m anner with window width 5 seconds and sliding step 1 second 
( jHachiya et all 1201 ll ). Depending on subjects, the position and orientation of the 
accelerometer was arbitrary — held by hand or kept in a pocket or a bag. For this 
reason, we took the £ 2 - n orm of the 3- dimensional acceleration vector at each time 
step, and computed the following 5 orientation-invariant features from each window: 
mean, standard deviation, fluctuation of amplitude, average e nergy, and frequency- 
domain entropy (jBao and Intilld . 120041 : iBharatula et al.l . 120051 ) . We randomly chose 
100 samples from each of the 3 classes, and used 300 samples in total. 

Speech (d = 50, n = 400, and c = 2): An in-house speech dataset, which contains short 
utterance samples recorded from 2 male subjects speaking in French with sampling 
rate 44.1kHz. From eac h utterance sampl e , we e xtracted a 50-dimensional line spec- 



tral frequencies vector (IKain and Maconl . Il988l ). We randomly chose 200 samples 



from each class, and used 400 samples in total. 



For each dataset, the experiment was repeated 100 times with random choice of sam- 
ples from the database, where the cluster size is balanced. Samples were centralized 
and their variance was normalized in the dimension-wise manner, before feeding them to 
clustering algorithms. 



4.3.3 Results 

The experimental results are described in Table [TJ For the digit dataset, MIC and SMIC 
outperform KM, SC, and MNN in terms of ARI. The entire computation time of SMIC 
including model selection is faster than KM, SC, and MIC, and is comparable to MNN 
which does not include a model selection procedure. For the face dataset, SC, MIC, and 
SMIC are comparable to each other and are better than KM and MNN in terms of ARI. 
For the document and word datasets, SMIC tends to outperform the other methods. For 
the accelerometry dataset, MNN and SMIC work better than the other methods. Finally, 
for the speech dataset, MIC and SMIC work comparably well, and are significantly better 
than KM, SC, and MNN. 

Overall, MIC was shown to work reasonably well, implying that the MLMI-based 
model selection strategy is practically useful. SMIC was shown to work even better than 
MIC, with much less computation time. The accuracy improvement of SMIC over MIC 
was gained by computing the SMIC solution in a closed-form without any heuristic initial- 
ization. The computational efficiency of SMIC was brought by the analytic computation 
of the optimal solution and the class- wise optimization of LSMI (see Section 

The performance of MNN and SC was rather unstable because of the heuristic averag- 
ing of the number of nearest neighbors in MNN and the heuristic choice of local scaling in 
SC. In terms of computation time, they are relatively efficient for small- to medium-sized 
datasets, but they are expensive for the largest dataset, digit. KM was not reliable for the 
document and speech datasets because of the restriction that the cluster boundaries are 
linear. For the digit, face, and document datasets, KM was computationally very expen- 
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Table 1: Experimental results on real- world datasets (with equal cluster size). The average 
clustering accuracy (and its standard deviation in the bracket) in terms of ARI and the 
average CPU computation time in second over 100 runs are described. Larger ARI is 
better, and shorter computation time is preferable. The best method in terms of the 
average ARI and methods judged to be comparable to the best one by the t-test at the 
significance level 1% are described in boldface. Computation time of MIC and SMIC 
corresponds to the time for computing a clustering solution after model selection has 
been carried out. For references, computation time for the entire procedure including 
model selection is described in the square bracket, which depends on the number of 
model candidates (in the current setup, we had 81 (=9x9) candidates. 

Digit (d = 256, n = 5000, and c = 10) 





KM 


sc 


MNN MIC 


SMIC 


ARI 


0.42(0.01) 


0.24(0.02) 


0.44(0.03) 0.63(0.08) 


0.63(0.05) 


Time 


835.9 


973.3 


318.5 ' 84.4(3631.7] 


14 4T359 51 






Face (d = 4096, 


n = 100, and c = 10) 






KM 


SC 


MNN MIC 


SMIC 


ARI 


0.60(0.11) 


0.62(0.11) 


0.47(0.10) 0.64(0.12) 


65(0 11) 


Time 


93.3 


2.1 


1.0 1.4[30.8] 


0[19 31 




Document (d = 50, n — 700, and c = 7) 






KM 


SC 


MNN MIC 


SMIC 


ARI 


0.00(0.00) 


0.09(0.02) 


0.09(0.02) 0.01(0.02) 


0.19(0.03) 


Time 


77.8 


9.7 


6.4 3.4[530.5] 


0.3(115.3] 






Word (d = 50, 


n = 300, and c = 3) 






KM 


SC 


MNN MIC 


SMIC 


ARI 


0.04(0.05) 


0.02(0.01) 


0.02(0.02) 0.04(0.04) 


0.08(0.05) 


Time 


6.5 


5.9 


2.2 1.0(369.6] 


0.2(203.9] 




Accelerometry (d - 


= 5, n — 300, and c = 3) 






KM 


SC 


MNN MIC 


SMIC 


ARI 


0.49(0.04) 


0.58(0.14) 


0.71(0.05) 0.57(0.23) 


0.68(0.12) 


Time 


0.4 


3.3 


1.9 0.8(410.6] 


0.2(92.6] 






Speech (d = 50 


, n = 400, and c = 2) 






KM 


SC 


MNN MIC 


SMIC 


ARI 


0.00(0.00) 


0.00(0.00) 


0.04(0.15) 0.18(0.16) 


0.21(0.25) 


Time 


0.9 


4.2 


1.8 0.7(413.4] 


0.3(179.7] 
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Table 2: Experimental results on real-world datasets under imbalanced setup. ARI values 
are described in the table. Class-imbalance was realized by setting the sample size of the 
first class m times larger than other classes. SMIC was computed with the uniform prior 
(i.e., the non-informative prior). The results for m = 1 are the same as the ones reported 
in Table HJ 

Digit (d = 256, n = 5000, and c = 10) 





KM 


SC 


MNN 


MIC 


SMIC 


777 


= l 


0.42(0.01) 


0.24(0.02) 


0.44(0.03) 


0.63(0.08) 


0.63(0.05) 


777 


= 2 


0.52(0.01) 


0.21(0.02) 


0.43(0.04) 


0.60(0.05) 


W.viOl VJtVJO 1 






Document (d = 


50, n = 700, and c = 7) 








KM 


SC 


MNN 


MIC 


SMIC 


rn 


= 1 


0.00(0.00) 


0.09(0.02) 


0.09(0.02) 


0.01(0.02) 


0.19(0.03) 


m 


= 2 


0.01(0.01) 


0.10(0.03) 


0.10(0.02) 


0.01(0.02) 


0.19(0.04) 


777 


= 3 


0.01(0.01) 


0.10(0.03) 


0.09(0.02) 


-0.01(0.03) 


0.16(0.05) 


777 


= 4 


0.02(0.01) 


0.09(0.03) 


0.08(0.02) 


-0.00(0.04) 


0.14(0.05) 








Word (d = 50 


, n = 300, and c = 3) 








KM 


SC 


MNN 


MIC 


SMIC 


777 


= 1 


0.04(0.05) 


0.02(0.01) 


0.02(0.02) 


0.04(0.04) 


0.08(0.05) 


777 


= 2 


0.00(0.07) 


-0.01(0.01) 


0.01(0.02) 


-0.02(0.05) 


0.03(0.05) 






Accelerometry (d 


= 5, n = 300, and c = 3) 








KM 


sc 


MNN 


MIC 


SMIC 


777 


= 1 


0.49(0.04) 


0.58(0.14) 


0.71(0.05) 


0.57(0.23) 


0.68(0.12) 


777 


= 2 


0.48(0.05) 


0.54(0.14) 


0.58(0.11) 


0.49(0.19) 


0.69(0.16) 


777 


= 3 


0.49(0.05) 


0.47(0.10) 


0.42(0.12) 


0.42(0.14) 


0.66(0.20) 


rn 


= 4 


0.49(0.06) 


0.38(0.11) 


0.31(0.09) 


0.40(0.18) 


0.56(0.22) 



sive since a large number of iterations were needed until convergence to a local optimum 
solution. 

Finally, we performed similar experiments under imbalanced setup, where the sample 
size of the first class was set to be m times larger than other classes with the total 
number of samples fixed to the same numbei0. The results are summarized in Table |2j 
showing that the performance of all methods tends to be degraded as the degree of cluster 
imbalance increases. Thus, clustering becomes more challenging if the cluster size is 
imbalanced. Among the compared methods, the proposed SMIC (with the uniform prior) 
still worked better than other methods. 

Overall, the proposed SMIC combined with LSMI was shown to be a useful alternative 
to existing clustering approaches. 



15 Because of the dataset size, this experiment was carried out only for several cases. See Table [5J 
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5 Conclusions 



In this paper, we proposed a novel information-maximization clustering method that 
learns class-posterior probabilities in an unsupervised manner so that the squared-loss 
mutual information (SMI) between feature vectors and cluster assignments is maximized. 
The proposed algorithm, called SMI-based clustering (SMIC), allows us to obtain cluster- 
ing solutions analytically by solving a kernel eige nvalue problem. Thu s , unli k e the previous 



infor mation-maximization clustering methods (lAgakov and Barber! . 120061 ; iGomes et al. 



201CH ) . SMIC does not suffer from the problem of local optima. Furthermore, we proposed 
to use an optimal non-parametric SMI estimator called least-squares mutual information 
(LSMI) for data-driven parameter optimization. Through experiments, SMIC combined 
with LSMI was demonstrated to compare favorably with existing clustering methods. 

In experiments, the proposed clustering method was shown to be useful for various 
types of data. However, the amount of improvement is large for some datasets, while it is 
mild for other datasets. It is thus practically important to have more insights on in what 
case the proposed method is advantageous. 

The sparse local-scaling kernel (|7j) was shown to be useful in experiments. Since this 
produces a sparse kernel matrix, the computation of SMIC (i.e., solving a kernel eigenvalue 
problem) can be carried out very efficiently However, if model selection is taken into 
account, the proposed clustering procedure is still computationally rather demanding 
due to the repeated computation of LSMI, which requires to solve a system of linear 
equations. In the experiments, we used the Gaussian kernel ( ITT]) for LSMI and found it 
useful in practice. However, it produces a dense kernel matrix and thus a dense system 
of linear equations need to be solved, which is computationally expensive. If a sparse 
kernel is used also for LSMI, its computational efficiency will be highly improved. In our 
preliminary experiments, the use of the sparse local-scaling kernel for LSMI improved the 
computational efficiency, but it did not perform as well as the Gaussian kernel. Thus, our 
important future work is to find a sparse kernel that gives an accurate approximation of 
SMI with high comp utational ef fi ciency . 

As addressed in ISong et al.l (120071 ). kernelized methods can be applied to cluster- 
ing of non-vectorial structured objects such as st rings, trees, a nd graphs by employing 



kerne l functions defined for such str u ctured data (Lodh i et al.l. 120021; iDuffv and Co llins 



2002 




Kashima and Kovanaei 




2002 


Gartner et al.. 


2003[ 


Gartner , 


2003). 



2002; Kashima et al.. 2003 



parameters, the performance of clustering methods without systematic model selection 
strategies depends on subjective parameter tuning, which is not preferable in practice. 
For Gaussian kernels, there exists a p opular heuristic that th e Gaussian width is set to 
the median distance between samples ( jFukumizu et al.l . 120091 ). However, there seems no 
such common heuristic for structured kernels. In such scenarios, the proposed method 
will be highly advantageous because it allows systematic model selection for any kernels. 
We will explore this direction in our future work. 

We experimentally showed that the proposed method with the uniform class-prior 
distribution still works reasonably well even when the true class-prior probability is not 
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uniform. This is a useful property in practice since the true class-prior probability is often 
unknown. Another way to address this issue is to estimate the true class-prior probability 
in a data-driven fashion, for example, iteratively performing clustering and updating the 
class-prior probabilities. We will investigate such an adaptive approach in our future 
work. 

The proposed method uses SMI as the common guidance for clustering, although we 
are using two SMI approximators: SMI defined by Eq.flS]) for finding clustering solutions 
and LSMI defined by Eq. flU]) for selecting models. Since SMI does not explicitly include 
cluster labels {yi}f = i, it has a simple form and therefore is suited for efficient maximization. 
Indeed, we can obtain an optimal solution analytically by solving an eigenvalue problem. 
However, since SMI is an unsupervised estimator where the cluster labels {yi}f = i are not 
used, it may not be accurate enough for model selection purposes. Indeed, our preliminary 
experiments showed that the use of SMI is not appropriate as a model selection criterion. 
On the other hand, since LSMI achieves the optimal non-parametric convergence rate, its 
high accuracy is suitable for model selection purposes. However, LSMI explicitly requires 
cluster labels {?/j}™ =1 and thus is not suited for efficient maximization. Based on the 
optimality of LSMI, we ideally want to use LSMI consistently for both finding clustering 
solutions and selecting models. However, its optimization involves discrete optimization 
of {yi}£=i, which is cumbersome in practice. Our future challenge is to develop a practical 
clustering algorithm based directly on LSMI. 
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Appendix: Rand Index and Adjusted Rand Index 



Here, we review t he definitions of the Ran d index (RI; iRandl . 197l[ ) and the adjusted 



Rand index (ARI; iHubert and Arabid . Il985l ). which are used for evaluating the quality of 



clustering results. Let {y*}^ =1 be the ground-truth cluster assignments, and let {?/i}™ =1 be 
a clustering solution obtained by some algorithm. The goal is to quantitatively evaluate 
the similarity between {yi}^ =1 and {y*}^ =1 . 

The most direct way to evaluate the discrepancy between {yi}f =1 and {y*}™ =1 would be 
to naively verify the correctness of the predicted labels. However, in clustering, predicted 
class labels {yi}f = i do not have to be equal to the true labels {?/*}™ =1 , but only their 
partition matters. The correctness of the partition may be evaluated by verifying the 
correctness of the predicted labels for all possible label permutations. However, this is 
computationally expensive if the number of classes is large. RI and ARI are alternative 
performance measures that can overcome this computational problem in a systematic way. 
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Table 3: Notation for Rand index and adjusted Rand index. 



(a) 



(b) 





C{ ■■■ C* 


Sum 


Ci 


ni,i ■ 




m 


c c 


n c ,i ■ 




n c 


Sum 


* 

n\ ■ 


* 


n 







Pairs in {C*,} c y/=1 






Same 


Different 


Pairs in 


Same 




m C£* 


{Cy}y=l 


Different 







For the two partitions {yi}f=i and {y*}™ =1 , let C y and C* (y = l,...,c) be sets of 
indices of samples in cluster y, respectively: 

C y = {y % | yi = y}, 
C* y = {y* I y* = y}- 

Let riy^i be the number of samples that are assigned to the cluster C y and the cluster C*,. 
Let n y (resp. n*) be the number of samples that are assigned to the cluster C y (resp. C*,). 
The notation is summarized in Table [3](a). 

Let mc,c*, m C c*, m^c*, and m c -c. be defined as 

y,y'=l V J 
™<c,c* := Yl (2) ~ m c£*, 

y=l V / 

m c,c* '■= \ 2) ~ mc > c *> 

where mc£* denotes the number of pairs of samples that are assigned to the same cluster 
both in {Cy} c y=l and {Cp} c y , 

=1; m c,c* denotes the number of pairs of samples that are 
assigned to the same cluster in {Cy) y=l but are assigned to different clusters in {C*,} yl=1 , 
m c,c* denotes the number of pairs of samples that are assigned to the same cluster in 
{C*,}y l=1 but are assigned to different clusters in {C y } y=1 , and m^c* denotes the number 
of pairs of samples that are assigned to different clusters both in {C^}^^ and {C y/ ^ y/ _^. 
mcfi* + m c,c* can be considered as the number of 'agreements' between {C y } y=l and 
{C*,} yl=1 , while m C Q* + mQ C , can be regarded as the number of 'disagreements' between 
{C y }y =l and {C*,} yl=1 . The notation is summarized in Table Mjo)- 



33 



The Rand index (RI; iRandl . Il97ll ) is defined and expressed as 

m C fi* + mc,c* 



RI :-- 



m c>c * + mc,c* + m c,c* + ™c,c* 
(m c ,c* +«Vc*) / 



The Rand index lies between and 1, and takes 1 if the two clustering solutions {C y } y=l 
and {c;,y yl =1 agree with each other perfectly. 

A potential drawback of the Rand index is that its expected value is not a constant 
(say, 0) if two clustering solutions are com pletely random. To overcom e this problem, the 
adjusted Rand index (ARI) was proposed ( Hubert and Arabie . Il985h . ARI is defined as 



ARI :-- 



m c ,c* + mc,c* - V 



™c,c* + m c> c* + m Cfi * + ™c~ e* 
/i is the expected value of m^c + ^c,c* : 

// := E [m c ,c* + mc t c*\ , 



where E denotes the expectation over cluster assignments. ARI takes the maximum value 
1 when two sets of cluster assignments are identical, and takes if the index equals its 
expected value. 

Under the assumption that the clustering solutions {C y }y =1 and {C*,}y, =1 are randomly 
drawn from a generalized hyper-geometric distribution, it holds that 



E [m c ,c*] = {m c ,c* + mc,c*)(mc,c* + m<* c .) / 

E [ m c,c*] = (mc,c* + m c,c*)( m c,c* + m c,c*) / 
Then ARI can be expressed as 



ARI 



E 

y,y'=i 



11 



y,y 



E 

y=l 



11. 



E 

y' = l 



IV 



E 

.»=i 



ii. 



+ E 

y' = l 



n 



y' 



c 

E 

y=l 



n. 



E 

y' = l 



2 



Note that RI and ARI can be defined even when two sets of cluster assignments {?/j}™ =1 
and {y*}2=i have different numbers of clusters, i.e., {C y } y=1 and {C*,}y l=1 with c ^ d . This 
is highly convenient in practice since, when the number of true clusters is large, clustering 
algorithms often produce clustering solutions with a smaller number of clusters (i.e., some 
of the clusters have no samples). Even in such cases, RI and ARI can still be used for 
evaluating the quality of clustering solutions. 
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